ROCm and HIP: A Detailed 10-Chapter Tutorial: The GPU Synchronicity Mindset Shift

The fundamental transition in high-performance computing involves moving from a CPU-centric serial execution model to a decoupled producer-consumer model where the CPU manages the pipeline while the GPU operates independently. The core realization is that the GPU is not meant to be driven as a strictly synchronous device; treating it as such creates a "stop-and-wait" bottleneck.

1. The Workflow Lifecycle

In an asynchronous mindset, the developer does not wait for each task to finish. Instead, they allocate memory, launch kernels, and copy back results by placing non-blocking requests into a hardware queue.

2. Overcoming Stalls

When the host is forced to synchronize after every operation, the execution gap—the travel time between CPU and GPU—dominates performance. By utilizing Asynchrony, the CPU continues to work while the GPU processes its stream, maximizing hardware saturation.

$$\text{Total Time} = \max(\text{CPU Work}, \text{GPU Work}) + \text{Sync Overhead}$$

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Which set of steps correctly converts a synchronous vector-add to use an explicit stream?

Call hipStreamCreate, use hipMemcpyAsync with the handle, and pass the handle as the 4th kernel argument.

Call hipDeviceSynchronize after every kernel launch and use hipMemcpy.

Set the stream parameter to NULL in all hipMemcpyAsync calls.

Replace hipMalloc with hipHostMalloc exclusively.

QUESTION 2

Why is a GPU considered 'not meant to be driven as a strictly synchronous device'?

Because it has no internal clock.

Because waiting for the CPU to confirm every command leaves thousands of cores idle.

Because memory transfers cannot be tracked by the CPU.

Because the GPU must manage its own power state.

QUESTION 3

What is the primary risk of forcing the host to synchronize after every operation?

Memory corruption.

Host-side stalling and loss of hardware saturation.

Increased power consumption on the GPU.

Kernel compile errors.

QUESTION 4

In the logistics warehouse analogy, what does the 'Conveyor Belt' represent?

A HIP Stream.

The GPU Driver.

The CPU Cache.

The VRAM buffer.

QUESTION 5

True or False: hipMemcpyAsync returns control to the CPU before the data transfer is complete.

True

False

Case Study: The Warehouse Manager's Bottleneck

Asynchrony Implementation

A legacy ROCm application uses standard hipMemcpy and kernel launches without stream handles. The CPU utilization is 98%, but the GPU is only at 15% utilization because it waits for the CPU to finish logging data before starting the next copy.

Explain how Asynchrony would fix this 'stop-and-wait' bottleneck.

Solution:
By using asynchrony, the CPU can enqueue the next data transfer and kernel launch to a HIP stream and immediately return to its logging tasks. This allows the GPU to process the stream in parallel with the CPU's logging, keeping the compute cores saturated.

Provide the code required to create a stream and launch a kernel into it (replacing a default launch).

Solution:

hipStream_t myStream;
hipStreamCreate(&myStream);
myKernel<<<grid, block, 0, myStream>>>(args);

What function must be called to ensure the data is fully copied back to the host before the CPU accesses it?

Solution:
hipStreamSynchronize(myStream); must be called. This is the explicit 'handshake' that confirms all previous work in that specific stream is complete.